Ideas and Challenges for using Synthetic Data to Augment Clinical Data

Laura Thompson, Ph.D.
Senior Mathematical Statistician
CDER/OTS/OB/DBIV/DRD
Food and Drug Administration

Bayesian Biostatistics 2024, Rockville, MD – USA

October 24, 2024

Disclaimer: This presentation reflects my views and should not be construed to represent FDA’s views or policies.

Challenges in rare disease studies

         

  • Rare diseases pose challenges for conducting clinical trials, primarily due to the small patient population size. Thus, it may be difficult to recruit enough participants for adequately powered studies.

  • Traditional randomized controlled trials (RCTs) may be infeasible or impractical in rare disease research.

  • Small datasets can lead to imprecise parameter estimates and limit the power of statistical analyses.

The emergence of generative AI and virtual patients

         

  • We are starting to see more examples of using generative models to create clinical data

    • Generative models for imaging data

      See Sizikova & CDRH colleagues (2024)1 (GANs, diffusion models, deconvolutional models, VAEs, etc.)

    • AnimalGAN from NCTR (Chen et al. 20232) –

      generation of synthetic clinical pathology measurements to assess toxicology of untested chemicals on animals

    • Digital Twins

      Unlearn.AI’s PROCOVA (prognostic covariate adjustment model)3

Some Past Bayesian Methods for Generating Synthetic Data

  • Synthetic control (Pennello & Thompson, 2008)4

  • Priors constructed using in silico models:

    • Stochastic engineering models to create virtual patients for prior information in medical device studies (Haddad et al. & MDIC, 20175)
    • In-silico models of biological systems (Kiagias et al. 20216)
  • Prior distributions on latent weight parameters of DL models:

    • Bayesian Variational Autoencoders (VAEs) (Kingma & Welling, 2014)
    • Bayesian Generative Adversarial Networks (Saatchi & Wilson, 20177) and Variational Bayes GAN (Chien & Kuo, 20198)
    • Bayesian (generative?) transformer models (~2020 - 2022)

Challenges with synthetic data

         

  • Challenge 1: How can synthetic data be generated such that they are in some sense exchangeable (i.e., interchangeable) with real data?
  • Challenge 2: How can we represent uncertainty in the generated synthetic data?
  • Challenge 3: How do we borrow strength from synthetic data to estimate drug/device/biologic performance on real patients?

Bayesian methods for synthetic data generation

         

  • Challenge 1: Can naturally incorporate prior information into the generative model (e.g., as an informative prior distribution on the model weights, as conditional information for the model). This may help generate data similar to real data.
  • Challenge 2: Can incorporate uncertainty into the data generation process by using prior distributions on model parameters (leading to posterior distns on parameters).

    • Can generate diverse instances by sampling different parent parameters from the posterior distribution, then sampling instances given the parameters.

    • Traditional synthetic data generation methods may rely on point estimates of parameters and thus not fully capture the underlying data distribution.

Challenge 3: How do we borrow strength from synthetic data to estimate product performance on real patients?

         

  • Bayesian hierarchical models (BHMs) are natural frameworks for combining data across sources

    • When estimating a parameter associated with the real data, a hierarchical model could borrow strength from synthetic data by assuming the parameters for the synthetic and real data are exchangeable, i.e., iid “draws” from same super-population.
    • Borrowing may improve the precision of the estimated parameter for the real data as the variation in parameters between synthetic and real data decreases.

Evaluating comparability of synthetic data to real source data

         

  • How well do the synthetic data capture statistical properties and relationships present in the real data?
  • Are the resulting synthetic dataset(s) exchangeable with the real dataset?

Evaluating comparability of synthetic data to source

         

  • Traditional metrics to assess comparability
    • Qualitative: Compare latent representations of real versus synthetic data (e.g., PCA, tSNE, UMAP) to ensure that the generated datasets capture properties of the source data distribution.
    • Quantitative: Compute cosine similarity or KS-test per variable across datasets
    • Quantitative: Train a classifier to predict real versus fake samples
  • Exchangeability - more a concept rather than a measure
    • Some borrowing methods assume exchangeability (e.g., HMs)
    • Regulatory Considerations – often rely on subject-matter experts, perhaps with some down-weighting of prior data

Proof-of-concept Exercise*:


Generate synthetic data and combine it with “real” (simulated) data




*I do not necessarily think this is ideal

Example: Bayesian GAN to generate synthetic data

         

  • GANs are deep neural network architectures consisting of a generator and a discriminator. They are set up to compete against each other (hence the term adversarial). Given some input data
    • the generator G tries to create truly new samples from the input data distribution, passing them on to the discriminator
    • the discriminator D receives both real and generated ata and tries to determine if what it gets was real or fake
    • G tries to fool D (make fake look real), who in turn tries to improve on distinguishing real vs. fake
  • By placing prior distributions on the parameters of the generator and discriminator, the Bayesian GAN approximates a posterior distribution on the parameters and then generates synthetic data from the approximate posterior predictive distribution.

Bayesian GAN vs. traditional GAN

         

  • As opposed to learning one generator and one discriminator, it learns distributions over possible generators and discriminators. Each generator in the distribution may focus on a different latent representation of the data.
  • Due to the prior distributions on the parameters of the generator network, Bayesian GANs introduce uncertainty into the synthetic data generation process.

Example - BGAN (create simulated “real” data)

         

Example similar to the multi-modal synthetic data generation example from the original BGAN paper (Saatchi & Wilson, 2017).

\[\underset{5000 \times 30}{\mathbf{X}_1} \sim N(\mathbf{\mu}_1, \mathbf{\Sigma}) \hspace{2mm} \underset{5000 \times 30}{\mathbf{X}_2} \sim N(\mathbf{\mu}_2, \mathbf{\Sigma})\] \[\mathbf{\Sigma}: \hspace{2mm} \sigma_{ii} = 1, \hspace{1mm} \sigma_{ij} = 0.2 \] The mean vectors were either all 1s or all -1s. \[\mathbf{\mu}_1 = [1,...,1] \hspace{5mm} \mathbf{\mu}_2 = [-1,...,-1]\] I added 8 pairwise interactions to the 30 covariates:

\[\hspace{5mm} \underset{5000 \times 38}{\mathbf{X}_j} \leftarrow \underset{5000 \times 30}{\mathbf{X}_j} + \text{8 interactions}\]

I simulated a binary “response” vector using the \(\mathbf{X}\)s and a coefficient vector \(\mathbf{\beta}\) drawn from a MVN distribution with correlations of 0.2.

\[\underset{500 \times 1}{\mathbf{y}_1} \sim Bern(p = {(1+\operatorname{exp}( \mathbf{X}_1\beta)})^{-1})\]

\[\underset{500 \times 1}{\mathbf{y}_2} \sim Bern(p = {(1+\operatorname{exp}( \mathbf{X}_2\beta)})^{-1})\]

\[\underset{38 \times 1}{\mathbf{\beta}} \sim N(\mathbf{0},\Sigma) \hspace{3mm} \sigma_{ii} = 1, \hspace{1mm} \sigma_{ij} = 0.2\] Training and validation sets, with and 80/20 split.

\[ \underset{(80/20)}{\text{Training/valid set: }} \begin{bmatrix} \mathbf{X}_1 & \vdots & \mathbf{y}_1 \\ \hline \mathbf{X}_2 & \vdots & \mathbf{y}_2 \end{bmatrix} \]

  • Generator/discriminator networks were very similar to those used in the paper:

    • 2-layer NN 25-1000-38 fully-connected with ReLU activation
    • 2-layer NN 38-1000-1 fully-connected with ReLU activation
  • Test set (to be used in the combining stage) was the same structure as training set, but with 100 samples, 52/48 split between clusters.

Example - BGAN (Fit GAN and generate synthetic data)

         

  • At convergence: Generate 10 synthetic datasets each of the size 200 (to collectively match size of validation set) by sampling different parameter vectors from the posterior distribution of generators.
  • Each synthetic dataset originates from a different generator.

Compare generated data to source (validation) data

  • Compare learned representations across synthetic and real datasets.

  • Comparison of first 2 PCs of each of 10 synthetic datasets with validation dataset

  • Compare histograms of validation data and (all) generated data across all 39 variables

  • Nonetheless: Despite some issues with comparability of synthetic data with real validation data, we will forge ahead anyway…

Combine real and synthetic data using a BHM

         

  • Each of the 10 synthetic datasets generated by the BayesGAN using different generator weight samples will be treated as a hypothetical “prior” study.

  • Suppose we obtain new dataset (test set, here) which we want to combine with the prior studies. (Qualitative comparisons with synthetic datasets were similar to previous slide).

  • In the BHM, there are 11 studies. The study-specific parameters will form level 2 of the model, and will borrow information from each other.

  • However, the data model used for observations in each study was different from that used in the generative model.

Description of Bayesian Hierarchical Model

         

Bayesian hierarchical logistic regression model, where the synthetic datasets (\(j=1,...,10\)) serve as prior studies for the “real” study (\(j=test\)).

\[y_{ij} \sim Bern(p_{ij}) \hspace{5mm} i=1,...,n_j = 100; \hspace{3mm} j=1,...,10,\text{test}\] \[p_{ij} = (1+\operatorname{exp}(\alpha_j +\mathbf{x}_{ij}^{T}\mathbf{\beta}_j))^{-1} \]

\[\mathbf{\beta}_j \sim N(\mathbf{0},\mathbf{\Sigma}_{\beta}), \hspace{3mm} \alpha_j \sim N(0, 10) \hspace{3mm} j=1,...,10,\text{test}\] LKJ prior on the correlation matrix (then transform back to covariance matrix \(\mathbf{\Sigma_{\beta}}\))

\[Cor(\mathbf{\beta}) = \mathbf{R}_{\mathbf{\beta}} = L_\Omega^T L_\Omega \sim LKJ\_Corr(\eta = 0.8) \propto \operatorname{det}(\mathbf{R}_{\mathbf{\beta}})^{\eta - 1}\]

We want to make inference on \(\mathbf{\beta}_{test}\).

Compare posterior estimates (CI) of \(\beta_{test}\) before and after borrowing from synthetic data

Comments on application of BHM to the example

         

  • The synthetic datasets from the BayesGAN may have been too diverse compared to the test dataset resulting in minimal borrowing across coefficient vectors.

  • A model with clusters of exchangeability may be more appropriate if some synthetic datasets are more similar to the current study than others. A Dirichlet process mixture model or LEAP model (Alt et al., 20249) could flexibly model alternatives to full exchangeability.

  • One could quantify how much was borrowed from synthetic datasets using prior effective sample size (PESS). PESS represents the amount of information contributed by the prior (synthetic datasets).

  • Several proposals for computing PESS. Approximation may be necessary for more complicated models. Reimherr et al. (2021)10 provide an approximation for a “multivariate” PESS.

Concluding remarks

         

  • Bayesian versions of synthetic data generation are potentially fruitful topics for future research.

    • Due the increasing availability of pre-trained models, these models might serve as priors to generate additional control subjects or even replace controls.
  • BHMs are natural structures for combining information across synthetic datasets and real data. But, more flexible versions may be needed.

  • Many newer generative models have natural levels of hierarchies that might be used (e.g., self-attention heads in multi-attention layers of transformers, generator networks in GANs)

  • NumPyro (numpy backend for pyro with JAX) can be useful for fitting HMs with large amounts of simulated data.

Footnotes

  1. Szikova et al. (2024) Synthetic data in radiological imaging: current state and future outlook, BJR|Artificial Intelligence, 1(1)

  2. Chen et al. (2023) AnimalGAN: A Generative Adversarial Network Model Alternative to Animal Studies for Clinical Pathology Assessment

  3. Walsh et al. (2021) Using digital twins to reduce sample sizes while maintaining power and statistical accuracy, Alzheimer’s Dement. 2021;17(Suppl. 9):e054657

  4. Pennello & Thompson (2008) Experience with reviewing Bayesian Medical Device Trials, Journal of Biopharmaceutical Statistics, 18:1, 81 - 115

  5. Haddad et al. (2017) Incorporation of stochastic engineering models as prior information in Bayesian medical device trials, Journal of Biopharmaceutical Statistics, DOI: 10.1080/10543406.2017.1300907

  6. Kiagias et al. (2021) Bayesian Augmented Clinical Trials in TB Therapeutic Vaccination, Front. Med. Technol. 3:719380.

  7. Saatchi and Wilson (2017) Bayesian GAN. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA

  8. Chien & Kuo (2019) Variational Bayesian GAN, 2019 27th European Signal Processing Conference (EUSIPCO)

  9. Alt et al. (2024) LEAP: the latent exchangeability prior for borrowing information from historical data, Biometrics, 80(3).

  10. Reimherr et al. Prior sample size extensions for assessing prior impact and prior-likelihood discordance, J R Stat Soc Series B. 2021;83:413–437.